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DECLARATION OF MARY FARIS 

I, Mary Faris, declare as follows: 
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1 . I am currently a Group Leader at Agensys, Inc., the assignee herein. Prior to my 
employment at Agensys, I was a Senior Scientist at Incyte Genomics. I have a Ph.D. in 
Immunology and Microbiology from Ohio State University and have held post-doctoral 
fellowships at the University of Virginia and the University of California at Los Angeles, School 
of Medicine. While at Incyte, I had considerable experience in expression analysis of cellular 
mRNA using chips with multiple probes. A copy of my curriculum vitae is attached as 
Exhibit A. 

2. I am aware that a question was raised as to the substance of Figure 7 that appeared 
in an article by Oh, J.M.C., in Proteomics (2001) 1:1303-1319. A copy of this figure, which is in 
color, is attached as Exhibit B and a copy of the article itself is attached as Exhibit C 
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3. The Oh, et al. article discusses a database for use in analysis of protein expression 
in lung cancers. The authors identify proteins or "spots" that are differentially regulated in 
different stages of lung cancer. An unidentified number of the samples analyzed for protein 
expression and included in the protein database were also analyzed for RNA expression using 
microarray technology. Keeping in mind that protein synthesis is dependent on mRNA 
transcription, then the presence of a specific protein indicates that the corresponding mRNA 
must have been present at the same or earlier time point. Since the data shown in Figure 7 is 
initiated from a protein based approach, specifically by asking which of the proteins in the 2D 
gel database have detectable corresponding mRNA by microarray analysis, then a correlation 
between protein and RNA expression would be expected. 

4. The explanation of Figure 7 in the Oh, et al. article is quite brief. It is discussed 
only at pages 1316-1317 in § 5.2. As stated in § 5.2 and as confirmed in the Figure legend, 
Figure 7 consists of 30 columns, each representing a protein spot obtained on a 2D gel. For each 
column, there are 200 entries, one per row, each representing mRNA probes on Affymetrix' 
HuFL chips measuring correlation or anti-correlation with the respective protein. The Figure 
provides 30 discrete sets of data, one set per protein, arranged side-by-side. Thus, the Figure 
represents selected, detectable proteins on 2D gel and whether a quantitatively similar amount of 
RNA corresponding to that protein was expressed at one point in time. 

5. I cannot identify from the article which 30 proteins are represented; it appears 
from the explanation in § 5.2 that some of them may be unidentified. Thus, I believe that this 
Figure is intended to reflect the data presentation concept set forth by Oh et al. 

6. I am familiar with the Affymetrix' HuFL chips, and understand that they contain 
mRNA corresponding to a variety of proteins. Thus, for each individual protein column, at most 
only a subset of the probes present on the HuFl chip would even be expected to hybridize with 
mRNA that actually encoded the protein. 

7. As explained in the Figure 7 legend, the level of RNA to protein correlation is 
represented in color, with red being a near-perfect correlation, green being a negative correlation, 

2 Serial No. 09/389,000 

Docket No. 511582002700 



and black is not defined. The data from Figure 7 appears to provide data for two groups of 
proteins. The mRNA that are near-perfectly correlated with one group are generally 
anticorrelated with the other group, as would be expected. This expression pattern correlates to 
the general quadrants in the Figure. 

8. It appears that in all cases there is some mRNA for which a high correlation is 
found. This data actually supports the assertions being made in the present case concerning the 
qualitative correlation of RNA to protein: In all cases RNA existed that highly correlated with 
the existence of the protein (some RNA is simply unrelated to this protein). Thus, each protein 
perfectly correlated with existence of relevant RNA. 

I declare that all statements made herein of my own knowledge are true and that all 
statements made on information and belief are believed to be true; and further, that these 
statements are made with the knowledge that willful, false statements and the like so made are 
punishable by fine or imprisonment or both, under Section 1001 of Title 18 of the United States 
Code and that such willful false statements may jeopardize the validity of the application or any 
patent issued thereon. 

Executed at Santa Monica, California, on 9 April 2003 
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Figure 7. Correlation matrix of 30 protein spots (columns) 
with mRNA levels as measured by 200 probe-sets on 
Affymetrlx HuFL chips. The correlation coefficients are 
depicted with colors, bright red being near-perfect corre- 
lation (r = 1) and bright green antlcorrelatlon (r = -1). The 
figure was made using the TreeView software (rana.lbi. 
gov/EisenSoftware.htm). 
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A database of- protein expression in lung cancer 

We hav- developed a comprehensive approach to identifying molecular changes in 
lU ng cancer that includes both genomic and proteomic analyses. The related I effort 
has produced a large , amount *f data pertaining to gene expression at the RNA and 
protein levels. As a result, we have constructed a database that contains , pro.em 
expression data on lung cancer as well as other relevant data including DNA micro- 
array derived data. A large number of proteins that are expressed ,n difrerent types 
of lung'cancer have been identified and have been correlated with the expression 
measures for their corresponding genes at the RNA level. The database is mtended to 
facilitate our effort at developing novel classification schemes for lung cancer and the 
identification oT novel markers for early diagnosis. 
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1 Introduction 

There is substantial interest in implementing novel and 
comprehensive strategies for the molecular analysis of 
tumors and relevant biological fluids. We have implement- 
ed a strat'egy for the molecular analysis of lung cancer 
that integrates genomic analysis using genome scanning 
procedures, transcriptomic analysis using cDNA and 
oligonucleotide microarrays. and proteomic analysis. For 
the latter, v/e have relied to date primarily on 2-D poly- 
acrylamide gels. However the 2-D gel approach is being 
increasingly complemented with additional analyses 
using liquid based protein separations and prote.n micro- 
arrays While on the one hand proteomic analysis com- 
plements genomic analysis for a global assessment of 
gene expression, on the other hand proteomic analysis 
uniquely contributes an understanding of protein post- 
translational modifications and the location of protein 
gene products in subcellular compartments. The scope 
of our overall molecular analysis study of lung cancer » 
shown in Fig. 1 . Important objectives include the deve tap- 
ment of novel molecular classification schemes for lung 
cancer and the identification of novel markers for the early 
detection of lung cancer. 

The large body of proteomic and other data we have col- 
lected has necessitated the construction of a database in 
which basic and derived data is organized. There have 
been relevant related efforts at databasing of 2-D «teta 
by other groups. One such database is the 2DWG i Mete- 
database of 2-D gel images, which contains 2-D denved 

Corresoondence: Dr. S. Hanash. University of Michigan Medical 
STlSO W. Mescal Center W^»Jg- **~ 
Research Building I, Ann Arbor Ml 48109-065o, U5A 
E-mail: shar^eumich.edu 
Fax:+1-734-647-8148 

Abbreviation: UPS. Laboratory information processing system 
© WILEY-VCH Verlag GmbH. 69451 Weinheim. 2001 



data acquired by a combination of review of results as 
well as submissions by investigators [1]. However, to 
date there are only three entries found matching the query 
for human lung images in the 2DWG Web Gel Meta-data- 
base web site (http://wvw-lecb.ncifcrf.gov/2awgDB). 
The database we have constructed, in its entirety, is rele- 
vant to a variety of cancers. However the focus of this 
review is the use of the database to achieve our objec- 
tives related to the molecular analysis of lung cancer 
specifically. The goal of the database is to facilitate 
planned analyses, i.e. statistical analysis, as well as 
post-planned analyses, i.e. data mining. The intent is to 
make the database queryable on a protein - by - protein 
basis as well as through subgrouping of samples ana- 
lyzed in a menu driven fashion. Internet and WWW tech- 
nologies are used not only to allow investigators to view 
visual and textual data together, but also to allow investi- 
gators in other locations to retrieve archival data us.ng 
different computer systems. 



2 Laboratory information processing 
system 

A long-standing Laboratory information processing sys- 
tem (LIPS) developed by our group [2] has been adapted 
for our database. LIPS consists of multiple systems and 
processes. A variety of data is stored in a variety of formats 
with individualized programs for viewing the data. Typical 
processes using UPS include: sample inventory; digitize 
images; detect and quantify spots; match spots and nor- 
malize spot sizes across images, choose spots for MS 
analysis, enter profiles from MS-Fit web search; transfer 
data to statistical software or spreadsheets. 

Data tend to be complex and dynamic in that their con- 
tents are ever changing as information is added, modified 
or removed. Simple or intensive analyses of 2-D patterns 

1615-9853/01/1009-1303$17.50+.50/0 
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have produced * targe amount of data. Data is both tex- 
tual (e.g.. reports and numbers) and v.sual (e.g.. 1-D and 

2-D gel images). 

Some types of data V^^^S^I^M 
tein gel images (silver, modified s.lver, blots *S labeled 
gels); genome scans; 1-D gel images; spot .nformat.on- 
protein names; gene information from DMA -cjoarrays 
MS files and MS-Fit reports (Word documents) figures 
(Raster files on the Sun and actual photographs); data 
from protein microarrays; data from liqu.d chromato- 
graphy separations. 

However, as computer technology has ev°lved^uantum 
jumps in improvements in organking unstructured^ ajen- 
fficdataintoastructu^database have become pass bia 
A major function of our database and its interfaces ,s to 
serve as a computer-based tool for captunng the base 
^titative data from 2-D gel image, .and denvec id*, 
and findings derived tan different stud.es about protons 
detected in 2-D patterns of various tumor types PL As a 
result investigators are provided with easy access to data 
a^lasameansforintelligentdataminingoftheex^ng 

data A logical view of the database schema .s shown n 
F^ 2 an?a list of tables and their attributes are shown ,n 
Table 1. 

The following are important features of the 2-D gel related 

component of our lung protein database: 

(1) All 2-D gel images are placed in hierarchies so that: (a) 

wewstuch image is matched to one master image, /.e. all 

Sd nLrcfnomatumorimages are ^matchec Ho .on. 

master image; (b) every master Image 

most one (higher) master image, i.e. all masters for d,ffer 

ent lung tumor types are matched to one tumor master. 
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Figure 1 . Methods and goals of 
lung cancer studies. 



This allows the database to have an indexing mechanism 
that can relate a spot to any gel in the hierarchy. The data- 
base provides a capability to access the basfc and 
derived data using the following types of queries: (a) given 
a spot on any gel, find all spots that are matched to it; (b) 
given a spot on any gel. find all protein identifications 
made for it. and (c) given a spot on any gel. find all find- 
ings/conclusions that are linked to it. 

(2) All samples (and thereby gels derived from them) are 
identified by a list of source characteristics in four major 
categories: experiment code; cell type code; treatment 
code- and fraction code. This allows the database to 
have'an identification mechanism that can relate a gel to 
any source in the hierarchy. The database provides a cap- 
ability to find all images as follows: (a) given a category, 
find all images that have the same value of the category; 
and (b) given any combination of four categones. find all 
images that satisfy the condition. 

(3) All protein spots are identified by a list of characteris- 
tics in four major attributes: protein name; pi and M,; 
accession number, and protein sequence data. A spot 
may have several findings and there may be many kmds 
of findings derived from a particular study. If possible the 
findings are recorded in a consistent way. however this .s 
not always possible due to some characteristics of such 
findings (e.g.. statistical analysis matrices, MS data, and 
Affymetrix data). As the number of studies has increased, 
the amount of data produced has increased. Some of the 
data (e.g. mass spectra and Affymetrix (Santa Clara CA, 
USA) oligonucleotide chip readouts) is very large, and fills 
up the hard disks of the computers where it is collected. 
Such data Is generally saved on CD-Rs, and only the most 
recent data is kept in a computer. It is sometimes easier 
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Figure 2. A logical view of data- 
base schema. 



Table 1 . A list of tables and their attributes in the lung protein database 



Table name 



Project 
Sub Project 
Subject 
Tissue Sample 

DNA Sample 
Gel 

Image 

Spot 

Match 

Experiment 

Cell Type 

Treatment 

Fraction 

Protein Sample 



Protein 
Other Link, 
Findings 

Protein Identification 



Unique identifier 
(Primary Key) 

Project Name 
Sub Project Name 
Subject ID 
Tissue Sample ID 

DNA Sample ID 
Gel ID 

Image Name 

Image Name & Spot No 
Match ID 
Experiment Code 
Cell Code 
Treatment Code 
Fraction Code 
Sample ID 



Protein Name 
Protein Link ID 
Image Name & Spot No 
Image Name & Spot No 



List of attribute types 

Project Type, Description 
Date Started. Comment 
Case No Sex, Birthdate, Comment 

TrueType, Diagnosis, Date SampIeTaken, Date Received. How 

Received, Source, Comment 
Dat* Produced, Concentration, Freezer Location, Comment 
Sample ID. Batch ID, Enzyme Combination. Electrophoresis 

Process, Comment 
Date Imaged, Exposure Time, Image Type, Image Location, 

Comment 
X f Y, Intensity, Spot Type 

Master Image Name. Master Spot No, Image Name. Spot No 

Description 

Description 

Description 

Description 

Experiment Code, Cell Code, Treatment Code, Treatment Date, 
Fraction Code, Comment, Project Type. Gel ID, Image Name. 
Image Type, Researcher 

Image Name. Spot No 

Protein Name, Database Name. URL 

Category, Designation, Finding 

Accession No, cDNA cloning, Ceil Lines, Facility .Date. 
Ger^* Cloning, Glycosylate, M r , p/. Phosphorylation. 
Phosphorylation Residues, Related Spot, Sequences, 
Source of Protein, Name, Structural Modification, Subcellular 
Localization, Tissue Distribution, Type of Membrane, 
Type of Sequencing 
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to pes: individuaJ files on trie web. Individual web pages 
h=ve been created with textual and visual data that are 
diScul'io relate in a table. This allows investigators an 
opportunity (o analyze 2-D gel and other images contain- ^ 
via sdo's that have not been detected or identified and to 
co-?'- data across studies. In addition this is used to 
lin'/our data to other biological knowledge repositories 
such as GenSank, PIR International, and SWISS-PROT. 



3 Contents of the lung cancer protein 
database 

A ia^* number bf studies involving lung cancer have been 
r^~,-d*nt!y performed in the laboratory. At the protein 
tN>se studies have resulted in 1349 images, over 
1000 of which are images of 2-D gels for which information 
has b«n recorded in the lung protein database. This num- 
baf p^nts a fraction of the 30 682 2-D gels produced by 
crrgroLOfordifferentstudies, which includestudies of other 
career *ypes encompassing head and neck, esophagus, 
'r/=r co:on. pancreas, ovary, breast, prostate, brain, leu- 
kemlas ar.d childhood tumors. A list of protein gel images 
re^a'ed 'o lung studies is shown inTable2. While lung adeno- 
carc=ncrr.as represent a major portion of the database, 
o'hflurgtumortyoes including squamous cell carcinomas 
and small cell lung cancers are represented, as are control 
. lurg tissues. Other 2-D patterns v/ere produced from 
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Table 2. A high-level categorization of lung protein 2-D 
images by sample type 



Lung Sample Types 


Cell Lines ■ 


421 


Cystic Fibrosis 


44 


Tumor 


635 


Normal 


170 


Plasma 


61 


Other 


18 


Total 


1349 



studies of cell lines that have been manipulated by trans- 
fection or by treatment with specific agents, as well 
as patterns produced after different cell ' fractionation 
schemes. Substantial emphasis is currently being placed 
on the comprehensive profiling of lung cancer derived 
surface membrane proteins. 

Mass spectrometry and/or N-terminal sequencing of pro- 
tein spots from 2-D gels of lung tumor samples or cell 
lines have led to the identification of a large number of 
proteins expressed jn lung cancer. Also, most identifi- 
cations made for proteins from a sample type can often 
be confidently transferred to matching protein spots on 
master images from lung studies. Table 3 and Fig. 3 ex- 
hibit some of the progress we have made in identifying 
proteins in 2-D gels of lung samples. 




Figure 3. Small cell lung tumor 
master containing identified pro- 
teins. 
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Name 
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ID 
Source 



L95 



LM 
LM 

L95 

DMS79 



LM 
LM 
LM 

SKMES 



LM 
L95 
L95 

LM 
LM 
A549 
LM 
LM 
L95 

LM 

SKMES 
LM 

DMS79 
LM 

DMS79 

LM 
LM 
A549 
A549 
L95 

DMS 79 
LM 

LM 



Spot# NCBl GenSank 
Accession Numoer 
Number 



P' 



Official 
gene 



(spot 1496L) possibly 
pacreatitis-associated 

protein 
14_3__3_sigma 
14_3_3__ZetaDelta 

14-3-3n 

6PF-2-K/FRU-2.6-P2ASE 

Liver isozymer 
ADP-ribosylation factor 1 
Albumin 
Albumin 
Albumin 

Aldehyde Dehydrogenase 
AtdoKeto Reductase 
Alkaline Phosphatase, 

Placental type 1 precursor 
Albumin 
Albumin 
a-Enolase 
a-helica! protein 
a-helixcoiled-coil rod 
homolugue 
a Tub! in 
Amyloid B4A 
Annexinl 
Annexin V 
ApoAl 

Apoprotein, pulmonary 

surfactant 
ARF1 

p-Galactoside soluble lectin 
p-Actin 
p-spectrin 
p Tubulin 

Calmodulin dependant 
phosphodiesterase 

Calreticulin 
Calreticulin32 
CGI-46 protein 
Chaperonirv-Jike protein 
Clathrin light chain A 
Collagen, type XV, a 1 

, Complexin II 
Cellular retinoic acid- 
binding protein 2 
Cellular retinol-binding 
protein 1.CRPB1 

Creatine kinase, brain 
Cytochrome CbxydaseVA 

Cytokeratin'8 



1496 



577 
615 
1279 
24 

928 
319 
800 

207 
543 
14 

666 
693 
332' 
268 
268 

172 
802 
460 
522 
685 
1278 

795 

96 

349 

61 

229 

22 



104 

469 

36 

149 

1338 

789 

85 

856 

855 

439 
872 
321 



398953 
112695 
437363 
2507173 

4502201 



P31947 
P29312 
AV&54S3 
P15118 



4.351 
4.569 



4502031 
3493209 
130737 



NPJ301649 6.31 



5.957 

NP_CO0680 6.811 
AAC36469 7.812 
P05187 5.86 



4503571 


NP_001419 


7.742 


45.407 


8272482 


AAFZ4221 






5360901 


BAA82153 






5174477 


NP.C06073 


5.099 
4.796 


52.848 
17.194 


113944 
4502107 


P04033 
NP.001145 


6.73 
4.83 
5.124 


39.264 
33.326 
25.4 


71967 


LNHUPS 







30.052 SFN 
29.101 YWHAZ 
YWHAH 
PFKFB1 

20.697 ARF1 
ALB . 
ALB 

70.244 ALB 
56.966 ALDH1A1 
32.379 AKR1B10 
57.954 ALPP 



ALB 
ALB 
ENOI 
HCR 







6.33 


227920 


1713410A 


5.34 


113270 


P02570 


5.29 


29497 


X59511 


4.75 


4507729 


NP_00106O 


11995077 


AB033211 




4757900 


NP_004335 


3.668 




3.442 


4929561 


AAD34041 


6.25 


4502543 


NPJQ01753 


7.034 


4502899 


NP_001824 




1362772 


E57233 


5.415 




4506451 


NP.002890 


4.667 


180570 


AAC31758 


5.34 




4.568 


1673575 


U76549 





ANXA1 
ANXA5 



18.909 ARF1 
14.584 LGALS1 

41.7 ACTB 
SPTB 

49.8 TUBB 



57.29 CALR 

48.772 

49.296 

60.547 CCT6A 
CLTA 
COL15A1 
CPLX2 

11.858 CRABP2 

10.297 RBP1 

42.618 CKB 
9.2 

KRT8 
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Table 3. Continued 



ID 

Source 



A549 
A549 
LM 
A549 



LM 
LM 

DMS 79 

LM 

L95 

LM 
LM 



LM 
A549 

DMS 79 
LM 
LM 
LM 

FMD 79 



LM 
LM 
LM 



A549 
A5.49 

LM 
LM 
L95 
L95 
L95 
A549 
L95 
L95 



Name 



Cytokeraiin 8 
• Cytokeraiin 8 

Cytokeraiin 15, keratin 15 

Dihydrolipoamide dehydro- 
genase, mitochondrial 
precursor 

DJ1 

DJ1_MER5 
dj475N16.1 (CTG4A) 
DUTPhase 

E2 ubiquitin-conjugating 

enzyme 
EIF4d 
EIF5A 

Enhancer of rudimentary 
(Droso'phila) homolog 
" Enolase 2 (y, neuronal) 
ENPL.HSP100 
F1 FO-type ATP synthase 

subunit d 
G1/S specific cyclin E1 
G3PD 



Spot # NCBI 

Accession 
Number 



7-Actin 
Glyoxalasel 

Granulocyte-macrophage 
colony-stimulating factor 
precursor 

GRP75 

GRP78 

GSTpi 

Heat shock 27 kD protein 1 
Heat shock 27 kD protein 1 
Heterogeneous nuclear 

ribonucleo protein H 
HLA-B71 orHLB-B71 

variant 
HSC70_HSP73 
HSP90 
HSPC089 
HSPC321 
HSPC321 
HuCha60SP60 
Huntingtin associated protein 
' Huntingtin associated protein 
Intemexin neuronal intermediate 

filament protein, alpha 
Keratin 17 
KIAA1610 protein 

LamR 



446 
439 
289 
759 



811 

700 

57 

769 

1445 

718 
839 
902 

295 
18 

1519 

31 
540 
348 
650 

' 86 



2506774 
2506774 
4504915 
118674 



6005749 
6969163 
4885417 



87 
79 
690 
626 
631 
457 

818 

120 
46 

1036 
1547 
1548 
181 
1595 
1548 
183 



119347 

5453559 

3041657 

113278 
417246 
117561 



726098 
123571 
123571 
5031753 

511776 



6841118 
6841292 
6841292 
4504521 
1708113 
1708113 
6225015 



GenBank 
Number 

P05787 
P05787 
NP_002266 
P09622 



NP_009193 

CAB75301 

AB022435 



P09104 

NP.0063475 

P24864 

P02571 
Q04760 
P04141 



AAC13869 
P04792 
P04792 
NP.005511 



U11269 



5.52 
5.52 
4.153 



6.44 
6.263 

5.719 



5.104 
4.599 



4.94 

4.945 

5.21 



7.457 
5.146 
4.833 



5.9341 

5.187 

5.5 

7.83 

7.83 



5.55 

5.893 
5.276 



Official 
gene 



AAF28912 
AAF28999 
AAF28999 
NP_002147 
P54255 
P54255 
Q16352 



5.7 



53.674 KRT8 
53.674 KRT8 
49.261 KRT15 



21.015 
24.001 

20.136 



22.961 
10.957 



DJ1 



HIP2 



ERH 



EN02 



47.286 
78.717 

18.491 ATP5JD 
CCNE1 

31.772 

42.315 ACTG1 
25.572 GL01 
CSF2 



73.124 
68.109 

25.4 GSTP1 
22.327 HSPB1 
22.327 HSPB1 
HNRPH1 



36.558 

72.429 
76.096 



61 HSPD1 
HAP1 
HAP1 
54.908 INA 

KRT17 
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LM 

A549 

A549 
LM 



L95 

DMS79 



DMS 79 



LM 
A549 



LM 

LM 
LM 
LM 
LM 
LM 
L95 
L95 
L95 



L95 
A549 

A549 



A427 



L95 



L95 
L95 



Spot # NCBI 

Accession 
Number 



Lectin, galactoside-binding, 

soluble, 1 (galectin 1) 
Lipocortin 

L-Lactate Dehydrogenase 

H chain 
L-lactate dehydrogenase 

H chain (LDH-B) 
LaminB 

Lymphocyte cytosolic 
protein 1 (L-plastin) 
Macropain subunit zeta 

MHC class 1 
histocompatability 

antigen protein 
Multicaiaytic endopeptidase 
com pies chain C2, 
— long splice from 
MyosinLightCahin3 
Nm23, NDPKA 
Non metastatic cells 1 , 

protein (NM23A) 
Op 18, leukemia-associated 
phosphoprotein p18 (stahmin) 

Op 18a 
Op 18m 

Phosphoglycerate MutB 
Phospholipase C 
PIMT 

pinch-2 protein 
Pinch-2 protein 
Possibly activin type II 

receptor precursor; 

DNA polymerase epsilon 

subunit B; or ITF-i DNA 

binding protein 
Possibly BTF2p44 
Possibly carbonci anhydrase III 

orUCH-L1;PGP9.5 

Possibly 8-3.5 5-2,4- 
Dienoly-CoA isomerase 
precusor 
Possibly G1 to S phase 
transition protein; serine- 
theonine phosphatase 
' protein; or phosphatase 

5 protein 
Possibly GCF2 fusion 
protein orBamacan 
homolog 
Possibly glycosyltransferase 

Possibly HLADQ 



873 227920 



460 
905 

906 

924 
924 

1338 
33 



113944 
126041 



4506187 
1236790 



74 346314 



815 

1456 

793 

809 

807 

808 

639 

248 

662 

1695 

1825 

627 



1496 
1242 

2138 



321 



320 



1519 
1271 



171 341 OA 

PO4083 
P07195 



4557032 NPJD02291 



4504965 NP_002289 



5.34 
6.73 



5.737 
5.20 



NP_002289 
U06487 



JC1445 



14.584 LGALS1 

39.264 ANXA1 
LDHB 

. LDM3 

69.625 

70.290 LCP1 

PSMAS 



6.51 



30.239 







4.11 


15.172 




127981 
4557797 


P15531 
NP_000260 


5.809 
5.83 


19.216 
17.148 


NME1 


5031851 


NP_005554 


5.783 


17.164 


LAP18 


5031851 
5031851 


NP_005554 
NP_005554 


4.962 

5.302 

7.083 

5.7 

6.211 


13.655 

14.857 

27.227 

56.5 

25.804 


LAP18 
. LAP18 


9800509 
9800509 


AAF99328 
AAF99328 
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ID 

Source 



A549 

L95 
L95 
L95 



L95 
A549 

L95 



L95 



LM 
LM 
L95 

L95 

LM 

LM 



Name 



SpoU NCBI 

Accession 
Number 



GenBank 
Number 



Mr 



Official 
gene 



DMS 79 



Possibly hydroxyacytglutathione 
hydrolase or B-lymphocyte 
Antigen CD20 
Possibly microtubute-based 

motor protein 
Possibly putative novel 
protein similar to HPS 
Possibly Spi-B; unnamed 
protein product {AK00 1844); 
or protein kinase (y15801) 
Possibly T-complex protein 
Possibly U1 small nuclear 

ribonuclear protein A 
Possibly unnamed protein 
product (AK000369) or 
syntaxin 
.. Possibly unnamed protein 
product or Pro0282p 
protein 
procollagen-proline, 
2-oxoglutarate 4-dioxgenase 
(proline 4-hydroxylase), beta 
polypeptide (protein disulfide 
isomerase; thyroid homone 
bindung protein p55) . 
proliferating cell nuclear 

antigen 
Protein phosphatase 2 . 
(formerly 2A), regulatory 
subunitA(PR 65), £-isoform 
Protein H precursor 
Protein kinase C inhibitor 1 
Pulmonary surfactant 

apoprotein precusor 
Pulmonary surfactant- 
associated protein 
R33729J 

Retinol-binding protein 1, 

cellular 
RoSS_AAntigen 
S100 calcium-binding 

protein A1 1 (calgizzarin) 
S100 calcium-binding 

protein A8 (calgranulin A) 
S1 00 calcium-binding 

protein A9 (calgranulin B) 
Serine/threonine protein 

phosphatase 2A,65kDa 

regulatory Subunit A, 

P isoform 
SET translocation (myeloid 



1080 

1438 
1427 
1187 



630 
1148 

1064 



1351 



110 


2507460 


P07237 


4.76 


57.116 


P4HB 


515 


129697 


P17070 


4.4 


37.5 


PCNA 


104 


5915685 


P30154 


4.84 


66.202 


PPP2R1B 


40 

882 

1278 


4885413 
190565 


NP_005331 
AAA36510 


3.714 
7.714 


62.182 
11.521 


HINT 
SFTPA1 


1278 


131412 


P07714 






SFTPA1 


848 
855 


3355455 
4506451 


AAC27824 
NP_002890 


7.508 
4.99 


13.163 
15.850 


RBP1 


69 
906 






3.215 


47.903 


S100A11 


910 


115442 


PO5109 


6.51 


10.834 


S100A8 


931 


6094219 


P50117 


6.37 


13.291 


S100A9 


14 


5915686 


P30154 


4.84 


66.202 


PPP2R1B 


376 


1711383 


Q01105 


4.12 


32.103 


SET 
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ID 

Source 



Name 



LM 
LM 



LM 
LM 

LM 

LM 

LM 

A549 

L95 

LM 
L95 
L95 

DMS79 



A549 



L95 
A549 

LM 
LM 
LM 

A427 
A549 
A549 



SpoN 



476 



Small glutamine-rich 

tetraricopeptide repeat 

(TPR)-containing 
Stratifin 

Superoxidedism CuZn 
Superoxide DismMN, 

superoxide dismutase 2, 

mitochondrial 
TCP 1 P subunit 
TCTP (translationally-control 

tumor protein 1) 
Tnioredoxin 
Tplastin HSP 70 
Transthyretin 

Triosephosphate isomerase 
Tropomyosin, cytoskeletal 

type ( tropomyosin 5 
Tropomyosin 4 
Troponin T 
Troponin T 
Tublin, p polypeptide 
Tumor associated hydroquinone 34 

(NADH) oxidase tNOS 
tyrosine 3-monooxygenase/ 
tryptophan 5-monooxyge- 
nase activation protein, 
epsiton polypeptide 
tyrosine 3-monooxygenase/ 
tryptophan 5-monooxy- 
genase activation protein, 
zeta polypeptide 
Tyrosine 3-monooxygenase/ 
tryptophan 5-monooxy- 
genase activation protein, 
theta polypeptide 
Ubiquitin carboxyl-terminal 
esterase L1 (ubiquitin 
thiolesterase), UCH-L1; 
PGP 9.5, GSTmu 
Unnamed protein product 
Urokinase plasminogen 

activator 
Vid1 
Vid2 
' Vid4 
Vimentin 
Vimentin 
Vimentin 
Vimentin 



NCBI 
Accession 
Number 

8134553 



GenBank 
Number 



043765 



896 
125 
842 
672 
550 

548 
866 
778 
229 



576 



615 



579 



656 



1270 
842 

293 
294 
337 
294 
606 
505 
47 



136050 
136096 

132744C0 

408217 

408217 

4507729 

6644157 

11681S3 



112695 



112690 



136681 



7023092 
437123 



4507894 

418249 

340234 



P00938 
P12324 

AAK17926 

AAB27731 

AAB27731 

NP_001060 

AF207881 

P4266 



P29312 



P27348 



P09936 



BAA91833 
S39495 



p/ 



4.81 



Mr 



Official 
gene 



577 


398953 


P31947 


4.68 


792 


134811 


P00441 


5.6 


737 


134565 


P04179 


7.887 


202 






5.89 


680 


4507659 


NP_003286 


4.688 



4.689 

5.862 

5.693 

7.2 

4.5 

4.377 



4.78 



4.63 



4.73 



4.68 



6.01 

4.712 
4.614 
4.464 



NMJJ03380 

PO8670 

M25246 



34.063 SGT 



27.774 SFN 
17.3 SOD1 
20.78 SOD2 



59.841 

25.143 TPT1 



9.207 

68.909 

14.714 

25.5 

31.9 



TPI1 



32.733 TPM4 



49.907 TUBB 



29.174 YWHAE 



26,645 YWHAZ 



27.764 YWHAQ 



5.283 27.745 UCHL1 



31.263 

47.485 
46.369 
45.322 



VIM 
VIM 
VIM 
VIM 
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In addition 'to 2-D' gel analysis, most lung adenocar- 
cinomas are examined at the genomic level using restnc- 
tion landmark genome scanning, and by mutation analy- 
sis for a small number of genes. Transcriptomic analysts is 
done primarily using oligonucleotide microarrays. as part . 
of our efforts to derive a molecular based classificafon o 
lung adenocarcinomas that is more predictive of cluneal 
behavior for this group of tumors than current class.fi- 
cation schemes. We also have similar molecular analyses 
of control lung tissue obtained from multiple sources 
including adjacent lung tissue from lung cancer patients 
aswellastissuesobtainedfromnon-cancerresectedlung. 

Only a fraction of the information in the 2-D patterns has 
b»en linked across all studies and analyses. The lung pro- 
Wn database contains the basic descriptive data of var- 
ious samples analyzed, the images of the 2-D patterns 
that resulted from these samples, the quantitative spot 
data and information about which spots have been 
matched to each other, and conclusions or findings about 
spots The. database is intended to allow not only the 
retrieval of existing data, but also to mine new information 
and knowledge about protein expression in lung cells. 
Data mining activities consist, for example, of reviewing 
previous studies and finding out which 2-D gel patterns 
and protein spots are interesting for post-planned analy- 
sis and new discoveries. Such discoveries denve from: 
(l)identificationof proteins thatexhibit interesting expres- 
sion profiles in 2-D patterns that have been regrouped 
from different experiments and studies; (2) expanded 
statistical analyses that cover protein expression patterns 
involving large numbers of experiments and .rnages; 
(3) relating our data involving proteins to outs.de .reforma- 
tion; and (4) relating proteomic data to genomic data. 

4 Use of the database for post planned 
analysis 

4.1 Virtual matching 

Interactive software packages are used to automatically 
detect and quantify spots and to match spots between 
different protein patterns, with visual edttng , to correct 
any errors in computer based match.ng. The spot match 
program has created indices that allow investors to 
quickly navigate through many gels and eas. ly compare 
spots on images from many different experiments and 
s^dies. discover proteins of interest, and access and 
view relevant data. Here the term "match" is used as a 
logical transitive" relation, which means rf spot A is 
matched to spot B and spot B 
then the spots A and C are cons.dered etched The 
lung protein database contains data on prote.ns detected 
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on various 2-D gels. Since all gels derived from whole 
cell or tissue lysates in the lung protain database are 
tied into a single hierarchy, protein identification data 
recorded for a spot is used to derive protein data for its 
matched spots using an advanced query capability of 
the database. This is known as "virtual matching" or "vir- 
tual protein identification", which allows investigators to 
access and view all matched images and the corres- 
ponding information from the lung protein database. 
With a click on a spot, one gets the result shown in 
Fig 4. The virtual protein identification feature does not 
provide a 100% level of certainty of protein identification, 
but it makes possible the display of spots of interest. A 
combination of automated recognition and manual edit- 
ing generally yields an accurate record of protein infor- 
mation in the database for previously unknown proteins. 
With this approach, the lung protein database will evolve 
and mature to include all correct data for further analys.s 
and data mining. 

4.2 Integrating protein spot data with MS data 

As interest in proteomic analysis grows, a number of very 
large public databases are available to access prote.n 
data via the internet. Public databases offer a sophis- 
ticated text search and keyword search, which links any 
entered keyword to all protein information associated with 
that keyword, to ensure easy access to all relevant data. 
Protein identification using MALDI-MS relies on database 
searches and usually has three components: (1) peak 
detection which allows automatic determination of pep- 
tide masses; (2) search in protein sequence databases 
(SWISS-PROT and/or GenBank) for protein entnes that 
match the masses; and (3) certainty calculation which 
determines the quality of the match for each protein in 
the list [41. An example of such a software tool is the Pep- 
Frag for searching protein and DNA sequence databases 
that can use different types of mass spectrometric infor- 
mation [5]. Fenyo [6] described methods and software 
tools in proteomics for identifying and charactenzing pro- 
teins which emphasizes MS combined with database 
searching. Proteolytic peptide ma PP in 9 md 9eno ™ 
database searching provide an automated means for 
identifying proteins, and the certainty of the results is 
computed by the number of masses matched for each 
protein [7]. Another useful tool is FindMod (httpY/www. 
expasy.ch/sprot/findmod.htmOforthesystematiccharac- 

terisation of proteins using mass spectrometiy [8]. 
We have created MS data forms that contain information 
used in mass spectrometry queries, summa/y informat.on 
(Rank, MOWSE score. % Masses Matched. MW. p/. Spe- 
cies Accession #. Protein Name) and additional informa- 
tion '(Summary ID. Submitted Mass. Matched Mass, Delta 
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Figure 4. Virtual protein identification by clicking a spot. 



PPM Start, End. Peptide Seq, Modifications, Unmatched 
Masses). An example of the MS data form is shown in 
Fig. 5. Integrating the lung protein database with MS 
data provides a record of protein identification and high 
level of integration with other public databases, although 
substantia! effort is required for data collection. We are 
currently evaluating an automated or semi-automated 
method of pulling these data when new information, 
which is relevant to our objectives, is available. 



4.3 Integrating protein data with microarray data 

As technology evolves, new computer aids and methods 
are introduced for genomic analysis as well as proteo- 
mic analysis. With respect to DNA microarray platforms, 
a current goal is to construct lung specific cDNA micro- 
arrays for lung cancer investigations. In the meantime 
RNA expression data for lung cancer is being collect- 
ed using an Affymetrix oligonucleotide based system. 
This system automates the identification and quantmca- 
tion of microarray spots. Data files contain integrated 
intensities for each spot and ratios showing fold changes 
per spot. The use of oligonucleotide based microarrays 
for RNA analysis in lung cancer by our group has resulted 



in a massive amount of data. Integration of protein infor- 
mation in the lung protein database with microarray data 
allows us to extend data analysis capability to encompass 
RNA and protein data for a subset of genes. 



5 Some findings derived from the lung 
cancer protein database 

5.1 Unique proteomic pattern of small cell 
lung cancer 

A major goal of our proteomic and genomic studies of 
lung cancer is to derive novel classification schemes that 
have utility in making a diagnosis, predicting outcome and 
in making therapeutic decisions. An important first step in 
this direction is to determine the ability of proteomic pro- 
filing to distinguish between known types 61 lung cancer. 
Specific protein differences between different types of 
cancer have been identified by other groups. In a recent 
study of breast, ovary and lung tumors, 20 differentially 
expressed proteins were identified [9] and in a prior study, 
15 polypeptides were found to be associated with differ- 
ent histopathotogical features of lung cancer [10, 11]. In a 
study of 25 adenocarcinomas of the lung, 12 small cell 
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lung cancers, and 16 squamous cell tumors, by our group 
(manuscript submitted) an initial analysis of protein 2-D 
patterns uncovered a group of 52 protein spots that dif- 
fered in average integrated intensity between the three 
groups. Performing simple two-sample Mests gave p 
values of less than 0.05 for the 52 spots for at least one 
of the pairs of groups. Most of the spots differed between 
small cell and the remaining two diagnostic groups, with 
47 spots differing significantly between small cell and 
adenocarcinoma groups and 44 between small cell and 
squamous (p<0.05). Between the adenocarcinoma and 



0* 



Figure 5. MS data form 



squamous groups 1 2 spots with difference of this signifi- 
cance were found. Summary data for some of the spots is 
presented in Table 4. The first two principal components 
of the data are graphed in Figure 6. and show that as a 
group the spots distinguish small cell tumors from the 
other two tumor types fairly easily. 

We have identified 39 of this set of 52 spots by either 
N-terminal sequencing and/or MS of spot digests. Small 
cell lung cancers were characterized by higher average 
amounts for some proteins associated with cell prolifera- 
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i 39 identified protein spots found to differ be^-n srr^i ceil, adenocarcinoma, and squamous tumors of the lung 
^;^fS\n the t-test columns are p values from the two-sided two-sample t-test companng each pa.r of 



Table 4. 



groups 



Spot Unigene description 



294 
319 
666 
800 
873 

928 
522 
515 
577 
626 
631 
793 
807 

809 

931 

104 

110 



183 

229 
289 
295 
376 

439 
460 
476 

576 



579 



615 



vimentin 
albumin 
albumin 
albumin 

lectin, galactoside-binding, soluble, 1 

(galectin 1) 
ADP-ribosylation factor 1 
annexin A5 

proliferating cell nuclear antigen - • 
stratifin 

heat shock 27 kD proteinl 
heat shock 27 kD proteinl 
non-metastatic cells 1, protein (NM23A) 
leukemia-associated phosphoprotein p18 
(stathmin) 

leukemia-associated phosphoprotein p18 

(stathmin) 
S100 calcium-binding protein A9 

(calgranulin B) 
protein phosphatase 2 (formerly 2A), 

regulatory 

subunitA(PR 65). beta isoform 
procollagen-proline, 2oxoglutarate 4-dioxy- 

genase (proline 4-hydroxyIase) beta 

polypeptide' (protein disulfide isomerase; 

thyroid hormone binding protein p55) 
internexin neuronal intermediate filament 

protein, alpha 
tubulin, beta polypeptide 
keratin 15 

enolase 2, (gamma, neuronal) 
SET translocation (myeloid leukemia- 
associated) 
creatine kinase, brain 
annexin A1 

small glutamine-rich tetratricopeptide 

repeat (TPR)-containing 
tyrosine 3-moncoxygenaseAryptophan 
5-monooxygenase acctivation protein, 
epsiton polypeptide 
tyrosine 3-moncoxygenase/trypthophan 
5-monooxgenase activation protein, 
theta polypeptide 
tyrosine 3-monooxygenase/ 
tryptophan 5-monooxygenase 
activation protein, zeta polypeptide 



Ofncal 

qene 

symfcc! 


„ j . 

c 

cs;- 
cir.c-.a 


Mean 
squa- 
mous 


Mean 
small 
cell 


(-test 
adenocar- 
cinoma vs 
small cell 


small cell 
vs squa- 
mous 


adenocar- 
cinoma vs 
squamous 


VIM 


1.35 


l.ib 


U.DJ 


n nm 

U.U IU 


0.015 


0.509 


ALB 


2.13 


1.0/ 


U./ 0 


0.001 


0.005 


0.231 


ALB 


0.72 


0.59 


n on 
U.tU 


n nn? 


0.030 


0.461 


ALB 


2-34 


1.80 


U.bo 


n mo 


0.034 


0.383 


LGALS1 


1.95 


1.69 


0.83 


0.000 


0.002 


0.310 


ARF1 


0.22 


0.19 


0.06 


0.012 


0.046 


0.607 


ANXA5 


0.4-5 


0.26 


0.39 


0.429 


0.202 


0.012 


PCNA 


0.15 


0.18 


0.36 


0.002 


0.011 


0.464 


SFN 


C.73 


1.39 


0.41 


0.129 


0.002 


0.029 


HSPB1 


0,37 


1.18 


0.30 


0.000 


0.002 


0.128 


HSP81 


1.04 


1.35 


0.46 


0.003 


0.017 


0.277 


NME1 


0.35 


0.43 


0.59 


0.003 


0.033 


0.253 


LAP 13 


0.03 


0.05 


0.92 


0.000 


0.000 


0.351 


LAP13 


0.55 


0.50 


3.83 


0.000 


0.000 


0.732 


S1G0A3 


0.95 


1.18 


0.24 


0.026 


0.001 


0.447 


PPP2R13 


0.17 


0.13 


0.65 


0.000 


0.001 


0.188 


P4H3 


0.10 


0.10 


U.JU 




0.049 


0.906 


1NA 


0.O4 


0.04 


0.16 


0.000 


0.000 


0.751 


TUB3 


0.14 


0.27 


0.83 


0.000 


o.ooo 


0.023 


KRT15 


0.35 


0.29 


0.65 


0.028 


o.uuy 


U.o4o 


EN02 


0.10 


0.23 


0.39 


0.000 


O.065 


0.025 


SET 


0.25 


0.17 


0.71 


0.000 


O.OOO 


0.031 


CKB 


C.11 


0.05 


0.16 


0,033 


O.OOO 


0.004 


ANXA1 


0.43 


0.42 


0.59 


0.014 




n £31 


SGT 


0.15 


0.19 


0.33 


0.000 


O.OOO 


0.241 


YV/HAE 


0.40 


0.38 


0.82 


0.000 


0.001 


0.697 


YWHAQ 


0.52 


0.55 


0.91 


0.000 


0.006 


0.703 


YWHAZ 


0.93 


1.09 


1.79 


0.000 


0.003 


0.336 
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Offical 


Mean 


Mean 


Msan 


l ISM 


t-test 


t-test 


gene 


adeno- 


squa- 


small 


adenocar- 


small cell 


adenocar- 


symbol 


car- 


mous 


ceil 


cinoma vs 


vs squa- 


cinoma vs 


cinoma 






small cell 


mous 


squamous 


UCHU 


0.17 


0.32 


0.35 


0.000 


0.005 


0.153 


RBP1 


0.42 


0.41 


0.77 


0.006 


0.014 


0.951 


CRABP2 


0.25 


0.33 


0.53 


0.000 


0.017 


0.037 


ERH 


0.33 


0.35 


0.76 


0.000 


0.000 


0.455 


S100A8 


1.46 


1.43 


0.35 


0.040 


0.001 


0.950 


KRT17 


0.16 


0.30 


0.15 


0.763 


0.073 


0.013 


ALB 


2.63 


1.93 


0.92 


0.000 


0.003 


0.133 


S0D2 


1.17 


1.22 


0.54 


0.013 • 


0.001 


0.836 


C0L15A1 


0.57 


0.50 


0.26 


0.031 


0.186 


0.658 


S100A11 


2.95 


2.62 


0.53 


O.OCO 


o.ooo 


0.506 


LCP1 


0.18 


0.13 


0.05 


0.000 


0.004 


0.034 



656 ubiquitin carboxy I -terminal esterase L1 

(ubiquitin thiolesterase) 
855 retinol-binding protein 1 , cellular 
855 cellular retinoic acid-binding protein 2 
902 enhancer of rudimentary (Drosophila) 

homolog 

910 S100 calcium-binding protein A3 

(calgamulin A) 
934 keratin 17 
693 allbumin 

737 superoxide dismutase 2, mitochondrial 

789 collagen, type XV, alpha 1 

9C6 SI CO calcium-binding protein A1 1 

(calgizzarin) - 
924 lymphocyte cytosolic protein 1 (L-plastin) 



m 



A o 

° a 



Figure 6. First two principle components for 52 protein 
spots distinguishing between lung tumor types. Small 
cell lung cancer samples are shown as squares, adeno- 
carcinomas as circles and squamous lung tumors as 
triangles. 



tion such as proliferating cell nuclear antigen (PCNA) and 
oncoprotein 18 (Op18) [12-15], particularly the once- 
phosphorylated form of Op18, as well as protein products 
of the UCHU , RBP1 , CRABP2, KRT1 5, and TUBB genes 
among others. Squamous cell and adenocarcinoma sam- 
ples had greater amounts of the S100 proteins S10OA8, 
S1 00A9, and S1 00A1 1 , as well as larger average amounts 
of both the unphosphorytated and phosphorylated 27 kD 
heat shock protein (HSPB1). These two groups also had 



larger amounts of several protein spots detected on these 
gels that did not occur in similar gels made from cell lines 
and were thought to be cleavage products from proteins 
present in cells or plasma surrounding the tumor cells 
(e.g. cleaved albumin). The number of protein spots that 
differed between lung adenocarcinomas and squamous 
tumors were fev/er than the number of proteins that dis- 
tinguished between small cell lung cancer and the other 
two lung cancer types. EN02 was smallest in the adeno- 
carcinoma group, while ANXA5 and CKB were lowest and 
KRT1 7 and SFN highest in the squamous carcinoma sam- 
ples. Several interesting spots found in the study remain 
to be definitively identified. 

5.2 Correlations between RNA and protein 
expression 

The availability of mRNA expression data from micro- 
arrays or Affymetrix chips for the same samples for which 
we have protein 2-D gel data permits several additional 
types of questions to be asked. We have thus far enter- 
tained only simple models of protein/mRNA relationships 
that ask which mRNA levels are most correlated with pro- 
tein spot sizes. Figure 7 depicts such a correlation matrix 
using colors rather than numerical data, since this makes 
K easier to visualize the relationships. In cases for which 
the identity of the protein spot is known such investiga- 
tions can answer the question of how well mRNA levels 
for a protein predict that protein's abundance. In cases 
of protein spots that have not yet been identified, or iden- 
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Figure 7. Correlation matrix of 30 protein spots (columns) 
with mRNA levels as measured by 200 probe-sets on 
Asymetrix HuFL chips. The correlation coefficients -are 
depicted with colors, bright red being near-perfect corre- 
lation (r = 1) and bright green anticorrelation (r = -1). The 
figure was made using the TreeView software (rana.lbl. 
gov/EisenSoftware.htm). 



expression in lung cancer has facilitated the identification 
of proteins that induce autoantibodies in addition to 
providing valuable information regarding the expression 
pattern of such antigens in different tumor types and cell 
lines. One such antigen we have identified in lung cancer 
is protein PGP 9.5 (Fig. 8) (Brichory era/, manuscript sub- 
mitted) [16]. PGP 9.5 v/as identified as a protein in lung 
cancer that induces autoantibodies as part of a study in 
which sera from 64 newly diagnosed patients with lung 
cancer, from 99 patients with other types of cancer and 
from 71 noncancer controls were analyzed for antibody- 
based reactivity against lung adenocarcinoma proteins 
resolved by 2-D PAGE. Gels containing separated pro- 
teins were blotted and subsequently hybridized with indi- 
vidual sera from patients orcontrots. Unlike controls, auto- 
antibodies against a protein identified by MS as protein 
gene product 9.5 (PGP 9.5) were detected in sera in 9 out 
of 64 patients with lung cancer. 

Circulating PGP 9.5 antigen was detected in sera from two 
additional patients with lung cancer, without detectable 
PGP 9.5 autoantibodies. PGP 9.5 is a neurospecific poly- 
peptide previously proposed as a marker for nonsmali ceil 
lung cancer, based on its expression in tumor tissue. Using 
A549 lung adenocarcinoma cell line, we have demonstrated 
that PGP 9.5 was present at the celt surface, as well as 
secreted. Thus, the findings of PG P 9.5 antigen and/or anti- 
bodies in serum of patients with lung cancer suggest that 
PGP 9.5 may have utility in lung cancer screening and diag- 
nosis, as part of a panel of such proteins or their corres- 
ponding antibodies, which we have identified. 



tified without high confidence, such correlations can lead 
to or confirm hypothetical spot identifications. More gen- 
erally one can search for larger groups of proteins and 
mRNA whose abundances are controlled by some com- 
mon mechanism. 



5.3 Identification of novel lung cancer markers 

We have utilized a proteomic approach to identify pro- 
teins that commonly induce an antibody response in lung 
cancer. Such identified proteins or their corresponding 
autoantibodies likely have substantial utility for cancer 
diagnosis. There is also evidence that autoantibodies 
may be present prior to clinical diagnosis and therefore 
detection of autoantibodies or of circulating antigens 
may have utility for screening and early diagnosis of can- 
cer. We have identified a battery of proteins that induce 
autoantibodies that are specific for different types of can- 
cer. We have identified a pane! of autoantibodies that are 
detectable in serum of lung cancer patients at the time 
of diagnosis. The availability of a database of protein 



6 Web pages 

The relational database for storage of sample, image, 
protein information and other related data is being con- 
structed in a stepwise fashion. Trie construction of a 
comprehensive database to collect all pertinent informa- 
tion is rather challenging and necessitates substantial 
resources. Similar effort in this area includes WebGel that 
is a web based gel database analysis system that con- 
tains previously quantified gel data generated from a 
stand-alone quantitative gel analysis system [16]. Public 
WebGel demonstration databases currently available 
can be found in the web site (httpy/www-lecb.ncifcrf. 
gov/webgel WebGel database). The task of web based 
retrieval of data from the protein database is rather com- 
plex as there are different kinds of data that may need 
to be retrieved. The microarray data could be stored in 
the database instead of Excel files, and the Access 2000 
database that the MS team utilizes could be transferred to 
the database. Tables are being built to eliminate any 
handwritten collection of data. Developing a database is 
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Figure 8. 2-D PAGE and Wes- 
tern blot analysis of A549 lung 
adenocarcinoma cell proteins. 
Panel 1 shows A549 2-D protein 
pattern after silver staining. The 
boxed area is shown in panel 2, 
in which arrows point to the 
location of PGP 9.5 forms 
(spots P1 to P3) recognized by 
sera from patients with lung 
cancer and the position of the 
form P4 recognized by a poly- 
clonal rabbit anti-PGP 9.5 anti- 
serum, which also recognizes 
P1-P3. Panel 2 shows close- 
ups of western blots hybridized 
with two different sera from 
patients with lung adenocar- 
cinoma that showed reactivity 
against PGP 9.5 proteins. 



hard because of complex and very large amount of 
unstructured data generated. There are conflicting 
pressures between "using what we've got already" 
and constructing something better. Sometimes there is 
a natural break in the data, such as when a shift is 
made from one platform type to another. Then one 
could "pile up" old data and organize it neatly. On the 
other hand, when new technologies are introduced, 
they require new ways of storing the data. The lung 
protein database is continuously evolving to enhance 
the relational schema to be more flexible and compre- 
hensive and to make data processing more robust and 
automatic. 

The lung protein database is a backbone to record pro- 
teome data for many different studies and to mine the 
existing data for new discoveries. The new generation 
LIPS provides investigators web-enabled interfaces to 
the laboratory databases and 2-D images with internet 
access. There is certainly a need for sharing information 
in the database on a global basis. We have used internet 
and WWW technologies to provide a distributed process 
with easy-to-use front-end user interface. Figure 9 shows 
a top level view of a web-based process for performing 
our studies from a data processing perspective. Some of 
our web pages were developed in Visual InterDev and 
ASP development environment on Microsoft and some 
were developed in Oracle 8i and WebDB web application 
environment on Solaris. As an example, the MS data web 
page is shown in Fig. 10. Detailed "how-to" document- 
ation is provided as on-line help for recently extended 
capabilities of LIPS. 




Figure 9. Web-based process of using lung protein data- 
base. 



7 Conclusion 

The value of the database we have constructed depends 
to a large measure on its content, the quality of data and 
the ease with which data can be retrieved and analyzed. 
While the amount of data generated is already quite size- 
able, it is likely that the database will continue to undergo 
substantial expansion. Proteins are. being identified at a 
rapid pace, thus enhancing our ability to link protein 
expression data with RNA based expression data for cor- 
responding genes. As such, the database, will play an 
Important role in achieving our objective of developing 
novel classification schemes for lung cancer and the 
identification of novel markers for early diagnosis. The 




database will also serve as a useful resource for other 
investigations of lung biology and of diseases other than 
lung cancer. 

Received May 20, 2001 
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